Searching a Terabyte of Text Using Partial Replication

نویسندگان

Zhihong Lu

Kathryn S. McKinley

چکیده

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a replica selection database to direct queries to relevant replicas that maintain query effectiveness, but at the same time restricts some searches to a small percentage of data to improve performance and scalability, and to reduce network latency. We first investigate query locality with respect to time and replica size using real logs from THOMAS and Excite. Our evidence indicates that there is sufficient query locality to justify partial replication for information retrieval and partial replication can achieve better performance than caching queries, because the replica selection algorithm finds similarity between non-identical queries, and thus increases observed locality. We then use a validated simulator to compare database partitioning to partial replication with load balancing, and find partial replication is much more effective at decreasing query response time than partitioning, even with fewer resources, and it requires only modest query locality. We also demonstrate the average query response time under 10 seconds for a variety of work loads with partial replication on a terabyte text database.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Text Retrieval for Large Digital Libraries

It is argued that digital libraries of the future will contain terabyte-scale collections of digital text and that full-text searching techniques will be required to operate over collections of this magnitude. Algorithms expected to be capable of scaling to these data sizes using clusters of modern workstations are described. First, basic indexing and retrieval algorithms operating at performan...

متن کامل

Language Models for Searching in Web Corpora

We describe our participation in the TREC 2004 Web and Terabyte tracks. For the web track, we employ mixture language models based on document full-text, incoming anchortext, and documents titles, with a range of webcentric priors. We provide a detailed analysis of the effect on relevance of document length, URL structure, and link topology. The resulting web-centric priors are applied to three...

متن کامل

Keyphrase Extraction : Enhancing Lists

This paper proposes some modest improvements to Extractor, a state-of-the-art keyphrase extraction system, by using a terabyte-sized corpus to estimate the informativeness and semantic similarity of keyphrases. We present two techniques to improve the organization and remove outliers of lists of keyphrases. The first is a simple ordering according to their occurrences in the corpus; the second ...

متن کامل

Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for TREC 2004

In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large s...

متن کامل

Amberfish at the TREC 2004 Terabyte Track

The TREC 2004 Terabyte Track evaluated information retrieval in largescale text collections, using a set of 25 million documents (426 GB). This paper gives an overview of our experiences with this collection and describes Amberfish, the text retrieval software used for the experiments.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Searching a Terabyte of Text Using Partial Replication

نویسندگان

چکیده

منابع مشابه

Scalable Text Retrieval for Large Digital Libraries

Language Models for Searching in Web Corpora

Keyphrase Extraction : Enhancing Lists

Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for TREC 2004

Amberfish at the TREC 2004 Terabyte Track

عنوان ژورنال:

اشتراک گذاری